Computation and Language
☆ RoboTwin: Dual-Arm Robot Benchmark with Generative Digital Twins (early version)
Yao Mu, Tianxing Chen, Shijia Peng, Zanxin Chen, Zeyu Gao, Yude Zou, Lunkai Lin, Zhiqiang Xie, Ping Luo
Effective collaboration of dual-arm robots and their tool use capabilities
are increasingly important areas in the advancement of robotics. These skills
play a significant role in expanding robots' ability to operate in diverse
real-world environments. However, progress is impeded by the scarcity of
specialized training data. This paper introduces RoboTwin, a novel benchmark
dataset combining real-world teleoperated data with synthetic data from digital
twins, designed for dual-arm robotic scenarios. Using the COBOT Magic platform,
we have collected diverse data on tool usage and human-robot interaction. We
present a innovative approach to creating digital twins using AI-generated
content, transforming 2D images into detailed 3D models. Furthermore, we
utilize large language models to generate expert-level training data and
task-specific pose sequences oriented toward functionality. Our key
contributions are: 1) the RoboTwin benchmark dataset, 2) an efficient
real-to-simulation pipeline, and 3) the use of language models for automatic
expert-level data generation. These advancements are designed to address the
shortage of robotic training data, potentially accelerating the development of
more capable and versatile robotic systems for a wide range of real-world
applications. The project page is available at
https://robotwin-benchmark.github.io/early-version/
comment: Project page: https://robotwin-benchmark.github.io/early-version/
☆ Masked Diffusion Models are Secretly Time-Agnostic Masked Models and Exploit Inaccurate Categorical Sampling
Masked diffusion models (MDMs) have emerged as a popular research topic for
generative modeling of discrete data, thanks to their superior performance over
other discrete diffusion models, and are rivaling the auto-regressive models
(ARMs) for language modeling tasks. The recent effort in simplifying the masked
diffusion framework further leads to alignment with continuous-space diffusion
models and more principled training and sampling recipes. In this paper,
however, we reveal that both training and sampling of MDMs are theoretically
free from the time variable, arguably the key signature of diffusion models,
and are instead equivalent to masked models. The connection on the sampling
aspect is drawn by our proposed first-hitting sampler (FHS). Specifically, we
show that the FHS is theoretically equivalent to MDMs' original generation
process while significantly alleviating the time-consuming categorical sampling
and achieving a 20$\times$ speedup. In addition, our investigation challenges
previous claims that MDMs can surpass ARMs in generative perplexity. We
identify, for the first time, an underlying numerical issue, even with the
32-bit floating-point precision, which results in inaccurate categorical
sampling. We show that the numerical issue lowers the effective temperature
both theoretically and empirically, leading to unfair assessments of MDMs'
generation results in the previous literature.
comment: 40 pages
☆ LongCite: Enabling LLMs to Generate Fine-grained Citations in Long-context QA
jiajie Zhang, Yushi Bai, Xin Lv, Wanjun Gu, Danqing Liu, Minhao Zou, Shulin Cao, Lei Hou, Yuxiao Dong, Ling Feng, Juanzi Li
Though current long-context large language models (LLMs) have demonstrated
impressive capacities in answering user questions based on extensive text, the
lack of citations in their responses makes user verification difficult, leading
to concerns about their trustworthiness due to their potential hallucinations.
In this work, we aim to enable long-context LLMs to generate responses with
fine-grained sentence-level citations, improving their faithfulness and
verifiability. We first introduce LongBench-Cite, an automated benchmark for
assessing current LLMs' performance in Long-Context Question Answering with
Citations (LQAC), revealing considerable room for improvement. To this end, we
propose CoF (Coarse to Fine), a novel pipeline that utilizes off-the-shelf LLMs
to automatically generate long-context QA instances with precise sentence-level
citations, and leverage this pipeline to construct LongCite-45k, a large-scale
SFT dataset for LQAC. Finally, we train LongCite-8B and LongCite-9B using the
LongCite-45k dataset, successfully enabling their generation of accurate
responses and fine-grained sentence-level citations in a single output. The
evaluation results on LongBench-Cite show that our trained models achieve
state-of-the-art citation quality, surpassing advanced proprietary models
including GPT-4o.
☆ LongLLaVA: Scaling Multi-modal LLMs to 1000 Images Efficiently via Hybrid Architecture
Expanding the long-context capabilities of Multi-modal Large Language
Models~(MLLMs) is crucial for video understanding, high-resolution image
understanding, and multi-modal agents. This involves a series of systematic
optimizations, including model architecture, data construction and training
strategy, particularly addressing challenges such as \textit{degraded
performance with more images} and \textit{high computational costs}. In this
paper, we adapt the model architecture to a hybrid of Mamba and Transformer
blocks, approach data construction with both temporal and spatial dependencies
among multiple images and employ a progressive training strategy. The released
model \textbf{LongLLaVA}~(\textbf{Long}-Context \textbf{L}arge
\textbf{L}anguage \textbf{a}nd \textbf{V}ision \textbf{A}ssistant) is the first
hybrid MLLM, which achieved a better balance between efficiency and
effectiveness. LongLLaVA not only achieves competitive results across various
benchmarks, but also maintains high throughput and low memory consumption.
Especially, it could process nearly a thousand images on a single A100 80GB
GPU, showing promising application prospects for a wide range of tasks.
comment: 19 pages, 7 figures, 6 tables
☆ Configurable Foundation Models: Building LLMs from a Modular Perspective
Chaojun Xiao, Zhengyan Zhang, Chenyang Song, Dazhi Jiang, Feng Yao, Xu Han, Xiaozhi Wang, Shuo Wang, Yufei Huang, Guanyu Lin, Yingfa Chen, Weilin Zhao, Yuge Tu, Zexuan Zhong, Ao Zhang, Chenglei Si, Khai Hao Moo, Chenyang Zhao, Huimin Chen, Yankai Lin, Zhiyuan Liu, Jingbo Shang, Maosong Sun
Advancements in LLMs have recently unveiled challenges tied to computational
efficiency and continual scalability due to their requirements of huge
parameters, making the applications and evolution of these models on devices
with limited computation resources and scenarios requiring various abilities
increasingly cumbersome. Inspired by modularity within the human brain, there
is a growing tendency to decompose LLMs into numerous functional modules,
allowing for inference with part of modules and dynamic assembly of modules to
tackle complex tasks, such as mixture-of-experts. To highlight the inherent
efficiency and composability of the modular approach, we coin the term brick to
represent each functional module, designating the modularized structure as
configurable foundation models. In this paper, we offer a comprehensive
overview and investigation of the construction, utilization, and limitation of
configurable foundation models. We first formalize modules into emergent bricks
- functional neuron partitions that emerge during the pre-training phase, and
customized bricks - bricks constructed via additional post-training to improve
the capabilities and knowledge of LLMs. Based on diverse functional bricks, we
further present four brick-oriented operations: retrieval and routing, merging,
updating, and growing. These operations allow for dynamic configuration of LLMs
based on instructions to handle complex tasks. To verify our perspective, we
conduct an empirical analysis on widely-used LLMs. We find that the FFN layers
follow modular patterns with functional specialization of neurons and
functional neuron partitions. Finally, we highlight several open issues and
directions for future research. Overall, this paper aims to offer a fresh
modular perspective on existing LLM research and inspire the future creation of
more efficient and scalable foundational models.
☆ Historical German Text Normalization Using Type- and Token-Based Language Modeling
Historic variations of spelling poses a challenge for full-text search or
natural language processing on historical digitized texts. To minimize the gap
between the historic orthography and contemporary spelling, usually an
automatic orthographic normalization of the historical source material is
pursued. This report proposes a normalization system for German literary texts
from c. 1700-1900, trained on a parallel corpus. The proposed system makes use
of a machine learning approach using Transformer language models, combining an
encoder-decoder model to normalize individual word types, and a pre-trained
causal language model to adjust these normalizations within their context. An
extensive evaluation shows that the proposed system provides state-of-the-art
accuracy, comparable with a much larger fully end-to-end sentence-based
normalization system, fine-tuning a pre-trained Transformer large language
model. However, the normalization of historical text remains a challenge due to
difficulties for models to generalize, and the lack of extensive high-quality
parallel data.
comment: 27 pages, 3 figures
☆ R2GQA: Retriever-Reader-Generator Question Answering System to Support Students Understanding Legal Regulations in Higher Education
In this article, we propose the R2GQA system, a Retriever-Reader-Generator
Question Answering system, consisting of three main components: Document
Retriever, Machine Reader, and Answer Generator. The Retriever module employs
advanced information retrieval techniques to extract the context of articles
from a dataset of legal regulation documents. The Machine Reader module
utilizes state-of-the-art natural language understanding algorithms to
comprehend the retrieved documents and extract answers. Finally, the Generator
module synthesizes the extracted answers into concise and informative responses
to questions of students regarding legal regulations. Furthermore, we built the
ViRHE4QA dataset in the domain of university training regulations, comprising
9,758 question-answer pairs with a rigorous construction process. This is the
first Vietnamese dataset in the higher regulations domain with various types of
answers, both extractive and abstractive. In addition, the R2GQA system is the
first system to offer abstractive answers in Vietnamese. This paper discusses
the design and implementation of each module within the R2GQA system on the
ViRHE4QA dataset, highlighting their functionalities and interactions.
Furthermore, we present experimental results demonstrating the effectiveness
and utility of the proposed system in supporting the comprehension of students
of legal regulations in higher education settings. In general, the R2GQA system
and the ViRHE4QA dataset promise to contribute significantly to related
research and help students navigate complex legal documents and regulations,
empowering them to make informed decisions and adhere to institutional policies
effectively. Our dataset is available for research purposes.
☆ Exploring Sentiment Dynamics and Predictive Behaviors in Cryptocurrency Discussions by Few-Shot Learning with Large Language Models
This study performs analysis of Predictive statements, Hope speech, and
Regret Detection behaviors within cryptocurrency-related discussions,
leveraging advanced natural language processing techniques. We introduce a
novel classification scheme named "Prediction statements," categorizing
comments into Predictive Incremental, Predictive Decremental, Predictive
Neutral, or Non-Predictive categories. Employing GPT-4o, a cutting-edge large
language model, we explore sentiment dynamics across five prominent
cryptocurrencies: Cardano, Binance, Matic, Fantom, and Ripple. Our analysis
reveals distinct patterns in predictive sentiments, with Matic demonstrating a
notably higher propensity for optimistic predictions. Additionally, we
investigate hope and regret sentiments, uncovering nuanced interplay between
these emotions and predictive behaviors. Despite encountering limitations
related to data volume and resource availability, our study reports valuable
discoveries concerning investor behavior and sentiment trends within the
cryptocurrency market, informing strategic decision-making and future research
endeavors.
☆ CMM-Math: A Chinese Multimodal Math Dataset To Evaluate and Enhance the Mathematics Reasoning of Large Multimodal Models
Wentao Liu, Qianjun Pan, Yi Zhang, Zhuo Liu, Ji Wu, Jie Zhou, Aimin Zhou, Qin Chen, Bo Jiang, Liang He
Large language models (LLMs) have obtained promising results in mathematical
reasoning, which is a foundational skill for human intelligence. Most previous
studies focus on improving and measuring the performance of LLMs based on
textual math reasoning datasets (e.g., MATH, GSM8K). Recently, a few
researchers have released English multimodal math datasets (e.g., MATHVISTA and
MATH-V) to evaluate the effectiveness of large multimodal models (LMMs). In
this paper, we release a Chinese multimodal math (CMM-Math) dataset, including
benchmark and training parts, to evaluate and enhance the mathematical
reasoning of LMMs. CMM-Math contains over 28,000 high-quality samples,
featuring a variety of problem types (e.g., multiple-choice, fill-in-the-blank,
and so on) with detailed solutions across 12 grade levels from elementary to
high school in China. Specifically, the visual context may be present in the
questions or opinions, which makes this dataset more challenging. Through
comprehensive analysis, we discover that state-of-the-art LMMs on the CMM-Math
dataset face challenges, emphasizing the necessity for further improvements in
LMM development. We also propose a Multimodal Mathematical LMM (Math-LMM) to
handle the problems with mixed input of multiple images and text segments. We
train our model using three stages, including foundational pre-training,
foundational fine-tuning, and mathematical fine-tuning. The extensive
experiments indicate that our model effectively improves math reasoning
performance by comparing it with the SOTA LMMs over three multimodal
mathematical datasets.
☆ MMMU-Pro: A More Robust Multi-discipline Multimodal Understanding Benchmark
Xiang Yue, Tianyu Zheng, Yuansheng Ni, Yubo Wang, Kai Zhang, Shengbang Tong, Yuxuan Sun, Ming Yin, Botao Yu, Ge Zhang, Huan Sun, Yu Su, Wenhu Chen, Graham Neubig
This paper introduces MMMU-Pro, a robust version of the Massive
Multi-discipline Multimodal Understanding and Reasoning (MMMU) benchmark.
MMMU-Pro rigorously assesses multimodal models' true understanding and
reasoning capabilities through a three-step process based on MMMU: (1)
filtering out questions answerable by text-only models, (2) augmenting
candidate options, and (3) introducing a vision-only input setting where
questions are embedded within images. This setting challenges AI to truly "see"
and "read" simultaneously, testing a fundamental human cognitive skill of
seamlessly integrating visual and textual information. Results show that model
performance is substantially lower on MMMU-Pro than on MMMU, ranging from 16.8%
to 26.9% across models. We explore the impact of OCR prompts and Chain of
Thought (CoT) reasoning, finding that OCR prompts have minimal effect while CoT
generally improves performance. MMMU-Pro provides a more rigorous evaluation
tool, closely mimicking real-world scenarios and offering valuable directions
for future research in multimodal AI.
☆ Towards a Unified View of Preference Learning for Large Language Models: A Survey
Bofei Gao, Feifan Song, Yibo Miao, Zefan Cai, Zhe Yang, Liang Chen, Helan Hu, Runxin Xu, Qingxiu Dong, Ce Zheng, Wen Xiao, Ge Zhang, Daoguang Zan, Keming Lu, Bowen Yu, Dayiheng Liu, Zeyu Cui, Jian Yang, Lei Sha, Houfeng Wang, Zhifang Sui, Peiyi Wang, Tianyu Liu, Baobao Chang
Large Language Models (LLMs) exhibit remarkably powerful capabilities. One of
the crucial factors to achieve success is aligning the LLM's output with human
preferences. This alignment process often requires only a small amount of data
to efficiently enhance the LLM's performance. While effective, research in this
area spans multiple domains, and the methods involved are relatively complex to
understand. The relationships between different methods have been
under-explored, limiting the development of the preference alignment. In light
of this, we break down the existing popular alignment strategies into different
components and provide a unified framework to study the current alignment
strategies, thereby establishing connections among them. In this survey, we
decompose all the strategies in preference learning into four components:
model, data, feedback, and algorithm. This unified view offers an in-depth
understanding of existing alignment algorithms and also opens up possibilities
to synergize the strengths of different strategies. Furthermore, we present
detailed working examples of prevalent existing algorithms to facilitate a
comprehensive understanding for the readers. Finally, based on our unified
perspective, we explore the challenges and future research directions for
aligning large language models with human preferences.
comment: Initial Commit, 21 pages
☆ A Comparative Study of Pre-training and Self-training
Pre-training and self-training are two approaches to semi-supervised
learning. The comparison between pre-training and self-training has been
explored. However, the previous works led to confusing findings: self-training
outperforms pre-training experienced on some tasks in computer vision, and
contrarily, pre-training outperforms self-training experienced on some tasks in
natural language processing, under certain conditions of incomparable settings.
We propose, comparatively and exhaustively, an ensemble method to empirical
study all feasible training paradigms combining pre-training, self-training,
and fine-tuning within consistent foundational settings comparable to data
augmentation. We conduct experiments on six datasets, four data augmentation,
and imbalanced data for sentiment analysis and natural language inference
tasks. Our findings confirm that the pre-training and fine-tuning paradigm
yields the best overall performances. Moreover, self-training offers no
additional benefits when combined with semi-supervised pre-training.
comment: 19 pages, 2 figures, 9 tables
☆ Pooling And Attention: What Are Effective Designs For LLm-Based Embedding Models?
The significant advancements of Large Language Models (LLMs) in generative
tasks have led to a growing body of work exploring LLM-based embedding models.
While these models, employing different pooling and attention strategies, have
achieved state-of-the-art performance on public embedding benchmarks, questions
still arise about what constitutes an effective design for LLM-based embedding
models. However, these models are often trained on different datasets, using
different LLM base models or training settings. Moreover, evaluations on public
embedding benchmarks often fail to report statistical significance, making it
difficult to determine which designs truly contribute to final performance.
This complicates the process for practitioners seeking optimal training recipes
for LLM-based embedding models. In this study, we conduct a large-scale
experiment by training a series of LLM-based embedding models using the same
training data and base model but differing in their pooling and attention
strategies. The results show that there is no one-size-fits-all solution: while
bidirectional attention and an additional trainable pooling layer outperform in
text similarity and information retrieval tasks, they do not significantly
surpass simpler designs like EOS-last token pooling and default causal
attention in clustering and classification tasks. Furthermore, we propose a new
pooling strategy, Multi-Layers Trainable Pooling, which transforms the outputs
of all hidden layers, rather than just the last layer, using a cross-attention
network. This method proves to be statistically superior in text similarity and
retrieval tasks compared to existing pooling methods. Overall, this paper sheds
light on effective training strategies for LLM-based embedding models.
comment: https://github.com/yixuantt/PoolingAndAttn
☆ Pre-training data selection for biomedical domain adaptation using journal impact metrics
Domain adaptation is a widely used method in natural language processing
(NLP) to improve the performance of a language model within a specific domain.
This method is particularly common in the biomedical domain, which sees regular
publication of numerous scientific articles. PubMed, a significant corpus of
text, is frequently used in the biomedical domain. The primary objective of
this study is to explore whether refining a pre-training dataset using specific
quality metrics for scientific papers can enhance the performance of the
resulting model. To accomplish this, we employ two straightforward journal
impact metrics and conduct experiments by continually pre-training BERT on
various subsets of the complete PubMed training set, we then evaluate the
resulting models on biomedical language understanding tasks from the BLURB
benchmark. Our results show that pruning using journal impact metrics is not
efficient. But we also show that pre-training using fewer abstracts (but with
the same number of training steps) does not necessarily decrease the resulting
model's performance.
☆ Alignment-Aware Model Extraction Attacks on Large Language Models
Model extraction attacks (MEAs) on large language models (LLMs) have received
increasing research attention lately. Existing attack methods on LLMs inherit
the extraction strategies from those designed for deep neural networks (DNNs)
yet neglect the inconsistency of training tasks between MEA and LLMs'
alignments. As such, they result in poor attack performances. To tackle this
issue, we present Locality Reinforced Distillation (LoRD), a novel model
extraction attack algorithm specifically for LLMs. In particular, we design a
policy-gradient-style training task, which utilizes victim models' responses as
a signal to guide the crafting of preference for the local model. Theoretical
analysis has shown that i) LoRD's convergence procedure in MEAs is consistent
with the alignments of LLMs, and ii) LoRD can reduce query complexity while
mitigating watermark protection through exploration-based stealing. Extensive
experiments on domain-specific extractions demonstrate the superiority of our
method by examining the extraction of various state-of-the-art commercial LLMs.
comment: Source code: https://github.com/liangzid/alignmentExtraction
☆ A Data Selection Approach for Enhancing Low Resource Machine Translation Using Cross-Lingual Sentence Representations
Machine translation in low-resource language pairs faces significant
challenges due to the scarcity of parallel corpora and linguistic resources.
This study focuses on the case of English-Marathi language pairs, where
existing datasets are notably noisy, impeding the performance of machine
translation models. To mitigate the impact of data quality issues, we propose a
data filtering approach based on cross-lingual sentence representations. Our
methodology leverages a multilingual SBERT model to filter out problematic
translations in the training data. Specifically, we employ an IndicSBERT
similarity model to assess the semantic equivalence between original and
translated sentences, allowing us to retain linguistically correct translations
while discarding instances with substantial deviations. The results demonstrate
a significant improvement in translation quality over the baseline
post-filtering with IndicSBERT. This illustrates how cross-lingual sentence
representations can reduce errors in machine translation scenarios with limited
resources. By integrating multilingual sentence BERT models into the
translation pipeline, this research contributes to advancing machine
translation techniques in low-resource environments. The proposed method not
only addresses the challenges in English-Marathi language pairs but also
provides a valuable framework for enhancing translation quality in other
low-resource language translation tasks.
comment: Accepted at I2CT 2024
☆ Detecting Calls to Action in Multimodal Content: Analysis of the 2021 German Federal Election Campaign on Instagram
This study investigates the automated classification of Calls to Action
(CTAs) within the 2021 German Instagram election campaign to advance the
understanding of mobilization in social media contexts. We analyzed over 2,208
Instagram stories and 712 posts using fine-tuned BERT models and OpenAI's GPT-4
models. The fine-tuned BERT model incorporating synthetic training data
achieved a macro F1 score of 0.93, demonstrating a robust classification
performance. Our analysis revealed that 49.58% of Instagram posts and 10.64% of
stories contained CTAs, highlighting significant differences in mobilization
strategies between these content types. Additionally, we found that FDP and the
Greens had the highest prevalence of CTAs in posts, whereas CDU and CSU led in
story CTAs.
comment: Accepted Archival Paper for the CPSS Workshop at KONVENS 2024. Camera
Ready Submission
☆ Deconfounded Causality-aware Parameter-Efficient Fine-Tuning for Problem-Solving Improvement of LLMs
Large Language Models (LLMs) have demonstrated remarkable efficiency in
tackling various tasks based on human instructions, but recent studies reveal
that these models often fail to achieve satisfactory results on questions
involving reasoning, such as mathematics or physics questions. This phenomenon
is usually attributed to the uncertainty regarding whether these models could
genuinely comprehend the knowledge embedded in the text or merely learn to
replicate the token distribution without a true understanding of the content.
In this paper, we delve into this problem and aim to enhance the reasoning
capabilities of LLMs. First, we investigate if the model has genuine reasoning
capabilities by visualizing the text generation process at the attention and
representation level. Then, we formulate the reasoning process of LLMs into a
causal framework, which provides a formal explanation of the problems we
observe in the visualization. Finally, building upon this causal framework, we
propose Deconfounded Causal Adaptation (DCA), a novel parameter-efficient
fine-tuning (PEFT) method to enhance the model's reasoning capabilities by
encouraging the model to extract the general problem-solving skills and apply
these skills to different questions. Experiments show that our method
outperforms the baseline consistently across multiple benchmarks, and with only
1.2M tunable parameters, we achieve better or comparable results to other
fine-tuning methods. This demonstrates the effectiveness and efficiency of our
method in improving the overall accuracy and reliability of LLMs.
☆ Creating Domain-Specific Translation Memories for Machine Translation Fine-tuning: The TRENCARD Bilingual Cardiology Corpus
This article investigates how translation memories (TM) can be created by
translators or other language professionals in order to compile domain-specific
parallel corpora , which can then be used in different scenarios, such as
machine translation training and fine-tuning, TM leveraging, and/or large
language model fine-tuning. The article introduces a semi-automatic TM
preparation methodology leveraging primarily translation tools used by
translators in favor of data quality and control by the translators. This
semi-automatic methodology is then used to build a cardiology-based Turkish ->
English corpus from bilingual abstracts of Turkish cardiology journals. The
resulting corpus called TRENCARD Corpus has approximately 800,000 source words
and 50,000 sentences. Using this methodology, translators can build their
custom TMs in a reasonable time and use them in their bilingual data requiring
tasks.
☆ OpenFact at CheckThat! 2024: Combining Multiple Attack Methods for Effective Adversarial Text Generation
Włodzimierz Lewoniewski, Piotr Stolarski, Milena Stróżyna, Elzbieta Lewańska, Aleksandra Wojewoda, Ewelina Księżniak, Marcin Sawiński
This paper presents the experiments and results for the CheckThat! Lab at
CLEF 2024 Task 6: Robustness of Credibility Assessment with Adversarial
Examples (InCrediblAE). The primary objective of this task was to generate
adversarial examples in five problem domains in order to evaluate the
robustness of widely used text classification methods (fine-tuned BERT, BiLSTM,
and RoBERTa) when applied to credibility assessment issues.
This study explores the application of ensemble learning to enhance
adversarial attacks on natural language processing (NLP) models. We
systematically tested and refined several adversarial attack methods, including
BERT-Attack, Genetic algorithms, TextFooler, and CLARE, on five datasets across
various misinformation tasks. By developing modified versions of BERT-Attack
and hybrid methods, we achieved significant improvements in attack
effectiveness. Our results demonstrate the potential of modification and
combining multiple methods to create more sophisticated and effective
adversarial attack strategies, contributing to the development of more robust
and secure systems.
comment: CLEF 2024 - Conference and Labs of the Evaluation Forum
☆ A Survey on Emergent Language
Jannik Peters, Constantin Waubert de Puiseau, Hasan Tercan, Arya Gopikrishnan, Gustavo Adolpho Lucas De Carvalho, Christian Bitter, Tobias Meisen
The field of emergent language represents a novel area of research within the
domain of artificial intelligence, particularly within the context of
multi-agent reinforcement learning. Although the concept of studying language
emergence is not new, early approaches were primarily concerned with explaining
human language formation, with little consideration given to its potential
utility for artificial agents. In contrast, studies based on reinforcement
learning aim to develop communicative capabilities in agents that are
comparable to or even superior to human language. Thus, they extend beyond the
learned statistical representations that are common in natural language
processing research. This gives rise to a number of fundamental questions, from
the prerequisites for language emergence to the criteria for measuring its
success. This paper addresses these questions by providing a comprehensive
review of 181 scientific publications on emergent language in artificial
intelligence. Its objective is to serve as a reference for researchers
interested in or proficient in the field. Consequently, the main contributions
are the definition and overview of the prevailing terminology, the analysis of
existing evaluation methods and metrics, and the description of the identified
research gaps.
☆ PUB: Plot Understanding Benchmark and Dataset for Evaluating Large Language Models on Synthetic Visual Data Interpretation
The ability of large language models (LLMs) to interpret visual
representations of data is crucial for advancing their application in data
analysis and decision-making processes. This paper presents a novel synthetic
dataset designed to evaluate the proficiency of LLMs in interpreting various
forms of data visualizations, including plots like time series, histograms,
violins, boxplots, and clusters. Our dataset is generated using controlled
parameters to ensure comprehensive coverage of potential real-world scenarios.
We employ multimodal text prompts with questions related to visual data in
images to benchmark several state-of-the-art models like ChatGPT or Gemini,
assessing their understanding and interpretative accuracy.
To ensure data integrity, our benchmark dataset is generated automatically,
making it entirely new and free from prior exposure to the models being tested.
This strategy allows us to evaluate the models' ability to truly interpret and
understand the data, eliminating possibility of pre-learned responses, and
allowing for an unbiased evaluation of the models' capabilities. We also
introduce quantitative metrics to assess the performance of the models,
providing a robust and comprehensive evaluation tool.
Benchmarking several state-of-the-art LLMs with this dataset reveals varying
degrees of success, highlighting specific strengths and weaknesses in
interpreting diverse types of visual data. The results provide valuable
insights into the current capabilities of LLMs and identify key areas for
improvement. This work establishes a foundational benchmark for future research
and development aimed at enhancing the visual interpretative abilities of
language models. In the future, improved LLMs with robust visual interpretation
skills can significantly aid in automated data analysis, scientific research,
educational tools, and business intelligence applications.
☆ An Analysis of Linear Complexity Attention Substitutes with BEST-RQ
Self-Supervised Learning (SSL) has proven to be effective in various domains,
including speech processing. However, SSL is computationally and memory
expensive. This is in part due the quadratic complexity of multi-head
self-attention (MHSA). Alternatives for MHSA have been proposed and used in the
speech domain, but have yet to be investigated properly in an SSL setting. In
this work, we study the effects of replacing MHSA with recent state-of-the-art
alternatives that have linear complexity, namely, HyperMixing, Fastformer,
SummaryMixing, and Mamba. We evaluate these methods by looking at the speed,
the amount of VRAM consumed, and the performance on the SSL MP3S benchmark.
Results show that these linear alternatives maintain competitive performance
compared to MHSA while, on average, decreasing VRAM consumption by around 20%
to 60% and increasing speed from 7% to 65% for input sequences ranging from 20
to 80 seconds.
comment: Accepted in the IEEE Soken Language Technology Workshop 2024
☆ More is More: Addition Bias in Large Language Models
In this paper, we investigate the presence of additive bias in Large Language
Models (LLMs), drawing a parallel to the cognitive bias observed in humans
where individuals tend to favor additive over subtractive changes. Using a
series of controlled experiments, we tested various LLMs, including GPT-3.5
Turbo, Claude 3.5 Sonnet, Mistral, Math$\Sigma$tral, and Llama 3.1, on tasks
designed to measure their propensity for additive versus subtractive
modifications. Our findings demonstrate a significant preference for additive
changes across all tested models. For example, in a palindrome creation task,
Llama 3.1 favored adding letters 97.85% of the time over removing them.
Similarly, in a Lego tower balancing task, GPT-3.5 Turbo chose to add a brick
76.38% of the time rather than remove one. In a text summarization task,
Mistral 7B produced longer summaries in 59.40% to 75.10% of cases when asked to
improve its own or others' writing. These results indicate that, similar to
humans, LLMs exhibit a marked additive bias, which might have implications when
LLMs are used on a large scale. Addittive bias might increase resource use and
environmental impact, leading to higher economic costs due to overconsumption
and waste. This bias should be considered in the development and application of
LLMs to ensure balanced and efficient problem-solving approaches.
comment: 25 pages, 8 figures
☆ Language is Scary when Over-Analyzed: Unpacking Implied Misogynistic Reasoning with Argumentation Theory-Driven Prompts
We propose misogyny detection as an Argumentative Reasoning task and we
investigate the capacity of large language models (LLMs) to understand the
implicit reasoning used to convey misogyny in both Italian and English. The
central aim is to generate the missing reasoning link between a message and the
implied meanings encoding the misogyny. Our study uses argumentation theory as
a foundation to form a collection of prompts in both zero-shot and few-shot
settings. These prompts integrate different techniques, including
chain-of-thought reasoning and augmented knowledge. Our findings show that LLMs
fall short on reasoning capabilities about misogynistic comments and that they
mostly rely on their implicit knowledge derived from internalized common
stereotypes about women to generate implied assumptions, rather than on
inductive reasoning.
☆ Word and Phrase Features in Graph Convolutional Network for Automatic Question Classification
Effective question classification is crucial for AI-driven educational tools,
enabling adaptive learning systems to categorize questions by skill area,
difficulty level, and competence. This classification not only supports
educational diagnostics and analytics but also enhances complex tasks like
information retrieval and question answering by associating questions with
relevant categories. Traditional methods, often based on word embeddings and
conventional classifiers, struggle to capture the nuanced relationships in
natural language, leading to suboptimal performance. To address this, we
propose a novel approach leveraging graph convolutional networks (GCNs), named
Phrase Question-Graph Convolutional Network (PQ-GCN) to better model the
inherent structure of questions. By representing questions as graphs -- where
nodes signify words or phrases and edges denote syntactic or semantic
relationships -- our method allows GCNs to learn from the interconnected nature
of language more effectively. Additionally, we explore the incorporation of
phrase-based features to enhance classification accuracy, especially in
low-resource settings. Our findings demonstrate that GCNs, augmented with these
features, offer a promising solution for more accurate and context-aware
question classification, bridging the gap between graph neural network research
and practical educational applications.
☆ A Comparative Study on Large Language Models for Log Parsing
Background: Log messages provide valuable information about the status of
software systems. This information is provided in an unstructured fashion and
automated approaches are applied to extract relevant parameters. To ease this
process, log parsing can be applied, which transforms log messages into
structured log templates. Recent advances in language models have led to
several studies that apply ChatGPT to the task of log parsing with promising
results. However, the performance of other state-of-the-art large language
models (LLMs) on the log parsing task remains unclear.
Aims: In this study, we investigate the current capability of
state-of-the-art LLMs to perform log parsing.
Method: We select six recent LLMs, including both paid proprietary (GPT-3.5,
Claude 2.1) and four free-to-use open models, and compare their performance on
system logs obtained from a selection of mature open-source projects. We design
two different prompting approaches and apply the LLMs on 1, 354 log templates
across 16 different projects. We evaluate their effectiveness, in the number of
correctly identified templates, and the syntactic similarity between the
generated templates and the ground truth.
Results: We found that free-to-use models are able to compete with paid
models, with CodeLlama extracting 10% more log templates correctly than
GPT-3.5. Moreover, we provide qualitative insights into the usability of
language models (e.g., how easy it is to use their responses).
Conclusions: Our results reveal that some of the smaller, free-to-use LLMs
can considerably assist log parsing compared to their paid proprietary
competitors, especially code-specialized models.
comment: Accepted for publication in the 18th ACM/IEEE International Symposium
on Empirical Software Engineering and Measurement (ESEM '24)
☆ DetectiveQA: Evaluating Long-Context Reasoning on Detective Novels
Zhe Xu, Jiasheng Ye, Xiangyang Liu, Tianxiang Sun, Xiaoran Liu, Qipeng Guo, Linlin Li, Qun Liu, Xuanjing Huang, Xipeng Qiu
With the rapid advancement of Large Language Models (LLMs), long-context
information understanding and processing have become a hot topic in academia
and industry. However, benchmarks for evaluating the ability of LLMs to handle
long-context information do not seem to have kept pace with the development of
LLMs. Despite the emergence of various long-context evaluation benchmarks, the
types of capability assessed are still limited, without new capability
dimensions. In this paper, we introduce DetectiveQA, a narrative reasoning
benchmark featured with an average context length of over 100K tokens.
DetectiveQA focuses on evaluating the long-context reasoning ability of LLMs,
which not only requires a full understanding of context but also requires
extracting important evidences from the context and reasoning according to
extracted evidences to answer the given questions. This is a new dimension of
capability evaluation, which is more in line with the current intelligence
level of LLMs. We use detective novels as data sources, which naturally have
various reasoning elements. Finally, we manually annotated 600 questions in
Chinese and then also provided an English edition of the context information
and questions. We evaluate many long-context LLMs on DetectiveQA, including
commercial and open-sourced models, and the results indicate that existing
long-context LLMs still require significant advancements to effectively process
true long-context dependency questions.
☆ What is lost in Normalization? Exploring Pitfalls in Multilingual ASR Model Evaluations EMNLP 2024
This paper explores the pitfalls in evaluating multilingual automatic speech
recognition (ASR) models, with a particular focus on Indic language scripts. We
investigate the text normalization routine employed by leading ASR models,
including OpenAI Whisper, Meta's MMS, Seamless, and Assembly AI's Conformer,
and their unintended consequences on performance metrics. Our research reveals
that current text normalization practices, while aiming to standardize ASR
outputs for fair comparison, by removing inconsistencies such as variations in
spelling, punctuation, and special characters, are fundamentally flawed when
applied to Indic scripts. Through empirical analysis using text similarity
scores and in-depth linguistic examination, we demonstrate that these flaws
lead to artificially inflated performance metrics for Indic languages. We
conclude by proposing a shift towards developing normalization routines that
leverage native linguistic expertise, ensuring more robust and accurate
evaluations of multilingual ASR models.
comment: Sumbitted to EMNLP 2024
☆ Large Language Models as Efficient Reward Function Searchers for Custom-Environment Multi-Objective Reinforcement Learning
Leveraging large language models (LLMs) for designing reward functions
demonstrates significant potential. However, achieving effective design and
improvement of reward functions in reinforcement learning (RL) tasks with
complex custom environments and multiple requirements presents considerable
challenges. In this paper, we enable LLMs to be effective white-box searchers,
highlighting their advanced semantic understanding capabilities. Specifically,
we generate reward components for each explicit user requirement and employ the
reward critic to identify the correct code form. Then, LLMs assign weights to
the reward components to balance their values and iteratively search and
optimize these weights based on the context provided by the training log
analyzer, while adaptively determining the search step size. We applied the
framework to an underwater information collection RL task without direct human
feedback or reward examples (zero-shot). The reward critic successfully correct
the reward code with only one feedback for each requirement, effectively
preventing irreparable errors that can occur when reward function feedback is
provided in aggregate. The effective initialization of weights enables the
acquisition of different reward functions within the Pareto solution set
without weight search. Even in the case where a weight is 100 times off, fewer
than four iterations are needed to obtain solutions that meet user
requirements. The framework also works well with most prompts utilizing GPT-3.5
Turbo, since it does not require advanced numerical understanding or
calculation.
☆ Abstractive Text Summarization: State of the Art, Challenges, and Improvements
Specifically focusing on the landscape of abstractive text summarization, as
opposed to extractive techniques, this survey presents a comprehensive
overview, delving into state-of-the-art techniques, prevailing challenges, and
prospective research directions. We categorize the techniques into traditional
sequence-to-sequence models, pre-trained large language models, reinforcement
learning, hierarchical methods, and multi-modal summarization. Unlike prior
works that did not examine complexities, scalability and comparisons of
techniques in detail, this review takes a comprehensive approach encompassing
state-of-the-art methods, challenges, solutions, comparisons, limitations and
charts out future improvements - providing researchers an extensive overview to
advance abstractive summarization research. We provide vital comparison tables
across techniques categorized - offering insights into model complexity,
scalability and appropriate applications. The paper highlights challenges such
as inadequate meaning representation, factual consistency, controllable text
summarization, cross-lingual summarization, and evaluation metrics, among
others. Solutions leveraging knowledge incorporation and other innovative
strategies are proposed to address these challenges. The paper concludes by
highlighting emerging research areas like factual inconsistency,
domain-specific, cross-lingual, multilingual, and long-document summarization,
as well as handling noisy data. Our objective is to provide researchers and
practitioners with a structured overview of the domain, enabling them to better
understand the current landscape and identify potential areas for further
research and improvement.
comment: 9 Tables, 7 Figures
☆ Determination of language families using deep learning
We use a c-GAN (convolutional generative adversarial) neural network to
analyze transliterated text fragments of extant, dead comprehensible, and one
dead non-deciphered (Cypro-Minoan) language to establish linguistic affinities.
The paper is agnostic with respect to translation and/or deciphering. However,
there is hope that the proposed approach can be useful for decipherment with
more sophisticated neural network techniques.
comment: First draft. Comments are welcome
☆ Large Language Models and Cognitive Science: A Comprehensive Review of Similarities, Differences, and Challenges
This comprehensive review explores the intersection of Large Language Models
(LLMs) and cognitive science, examining similarities and differences between
LLMs and human cognitive processes. We analyze methods for evaluating LLMs
cognitive abilities and discuss their potential as cognitive models. The review
covers applications of LLMs in various cognitive fields, highlighting insights
gained for cognitive science research. We assess cognitive biases and
limitations of LLMs, along with proposed methods for improving their
performance. The integration of LLMs with cognitive architectures is examined,
revealing promising avenues for enhancing artificial intelligence (AI)
capabilities. Key challenges and future research directions are identified,
emphasizing the need for continued refinement of LLMs to better align with
human cognition. This review provides a balanced perspective on the current
state and future potential of LLMs in advancing our understanding of both
artificial and human intelligence.
comment: 10 pages, 1 figure
☆ STAB: Speech Tokenizer Assessment Benchmark
Shikhar Vashishth, Harman Singh, Shikhar Bharadwaj, Sriram Ganapathy, Chulayuth Asawaroengchai, Kartik Audhkhasi, Andrew Rosenberg, Ankur Bapna, Bhuvana Ramabhadran
Representing speech as discrete tokens provides a framework for transforming
speech into a format that closely resembles text, thus enabling the use of
speech as an input to the widely successful large language models (LLMs).
Currently, while several speech tokenizers have been proposed, there is
ambiguity regarding the properties that are desired from a tokenizer for
specific downstream tasks and its overall generalizability. Evaluating the
performance of tokenizers across different downstream tasks is a
computationally intensive effort that poses challenges for scalability. To
circumvent this requirement, we present STAB (Speech Tokenizer Assessment
Benchmark), a systematic evaluation framework designed to assess speech
tokenizers comprehensively and shed light on their inherent characteristics.
This framework provides a deeper understanding of the underlying mechanisms of
speech tokenization, thereby offering a valuable resource for expediting the
advancement of future tokenizer models and enabling comparative analysis using
a standardized benchmark. We evaluate the STAB metrics and correlate this with
downstream task performance across a range of speech tasks and tokenizer
choices.
comment: 5 pages
☆ How Privacy-Savvy Are Large Language Models? A Case Study on Compliance and Privacy Technical Review
Xichou Zhu, Yang Liu, Zhou Shen, Yi Liu, Min Li, Yujun Chen, Benzi John, Zhenzhen Ma, Tao Hu, Bolong Yang, Manman Wang, Zongxing Xie, Peng Liu, Dan Cai, Junhui Wang
The recent advances in large language models (LLMs) have significantly
expanded their applications across various fields such as language generation,
summarization, and complex question answering. However, their application to
privacy compliance and technical privacy reviews remains under-explored,
raising critical concerns about their ability to adhere to global privacy
standards and protect sensitive user data. This paper seeks to address this gap
by providing a comprehensive case study evaluating LLMs' performance in
privacy-related tasks such as privacy information extraction (PIE), legal and
regulatory key point detection (KPD), and question answering (QA) with respect
to privacy policies and data protection regulations. We introduce a Privacy
Technical Review (PTR) framework, highlighting its role in mitigating privacy
risks during the software development life-cycle. Through an empirical
assessment, we investigate the capacity of several prominent LLMs, including
BERT, GPT-3.5, GPT-4, and custom models, in executing privacy compliance checks
and technical privacy reviews. Our experiments benchmark the models across
multiple dimensions, focusing on their precision, recall, and F1-scores in
extracting privacy-sensitive information and detecting key regulatory
compliance points. While LLMs show promise in automating privacy reviews and
identifying regulatory discrepancies, significant gaps persist in their ability
to fully comply with evolving legal standards. We provide actionable
recommendations for enhancing LLMs' capabilities in privacy compliance,
emphasizing the need for robust model improvements and better integration with
legal and regulatory requirements. This study underscores the growing
importance of developing privacy-aware LLMs that can both support businesses in
compliance efforts and safeguard user privacy rights.
comment: 8 pages, 4 figures
☆ Do Large Language Models Possess Sensitive to Sentiment?
Yang Liu, Xichou Zhu, Zhou Shen, Yi Liu, Min Li, Yujun Chen, Benzi John, Zhenzhen Ma, Tao Hu, Zhiyang Xu, Wei Luo, Junhui Wang
Large Language Models (LLMs) have recently displayed their extraordinary
capabilities in language understanding. However, how to comprehensively assess
the sentiment capabilities of LLMs continues to be a challenge. This paper
investigates the ability of LLMs to detect and react to sentiment in text
modal. As the integration of LLMs into diverse applications is on the rise, it
becomes highly critical to comprehend their sensitivity to emotional tone, as
it can influence the user experience and the efficacy of sentiment-driven
tasks. We conduct a series of experiments to evaluate the performance of
several prominent LLMs in identifying and responding appropriately to
sentiments like positive, negative, and neutral emotions. The models' outputs
are analyzed across various sentiment benchmarks, and their responses are
compared with human evaluations. Our discoveries indicate that although LLMs
show a basic sensitivity to sentiment, there are substantial variations in
their accuracy and consistency, emphasizing the requirement for further
enhancements in their training processes to better capture subtle emotional
cues. Take an example in our findings, in some cases, the models might wrongly
classify a strongly positive sentiment as neutral, or fail to recognize sarcasm
or irony in the text. Such misclassifications highlight the complexity of
sentiment analysis and the areas where the models need to be refined. Another
aspect is that different LLMs might perform differently on the same set of
data, depending on their architecture and training datasets. This variance
calls for a more in-depth study of the factors that contribute to the
performance differences and how they can be optimized.
comment: 10 pages, 2 figures
☆ Diversify-verify-adapt: Efficient and Robust Retrieval-Augmented Ambiguous Question Answering
The retrieval augmented generation (RAG) framework addresses an ambiguity in
user queries in QA systems by retrieving passages that cover all plausible
interpretations and generating comprehensive responses based on the passages.
However, our preliminary studies reveal that a single retrieval process often
suffers from low quality results, as the retrieved passages frequently fail to
capture all plausible interpretations. Although the iterative RAG approach has
been proposed to address this problem, it comes at the cost of significantly
reduced efficiency. To address these issues, we propose the
diversify-verify-adapt (DIVA) framework. DIVA first diversifies the retrieved
passages to encompass diverse interpretations. Subsequently, DIVA verifies the
quality of the passages and adapts the most suitable approach tailored to their
quality. This approach improves the QA systems accuracy and robustness by
handling low quality retrieval issue in ambiguous questions, while enhancing
efficiency.
☆ NUDGE: Lightweight Non-Parametric Fine-Tuning of Embeddings for Retrieval
$k$-Nearest Neighbor search on dense vector embeddings ($k$-NN retrieval)
from pre-trained embedding models is the predominant retrieval method for text
and images, as well as Retrieval-Augmented Generation (RAG) pipelines. In
practice, application developers often fine-tune the embeddings to improve
their accuracy on the dataset and query workload in hand. Existing approaches
either fine-tune the pre-trained model itself or, more efficiently, but at the
cost of accuracy, train adaptor models to transform the output of the
pre-trained model. We present NUDGE, a family of novel non-parametric embedding
fine-tuning approaches that are significantly more accurate and efficient than
both sets of existing approaches. NUDGE directly modifies the embeddings of
data records to maximize the accuracy of $k$-NN retrieval. We present a
thorough theoretical and experimental study of NUDGE's non-parametric approach.
We show that even though the underlying problem is NP-Hard, constrained
variations can be solved efficiently. These constraints additionally ensure
that the changes to the embeddings are modest, avoiding large distortions to
the semantics learned during pre-training. In experiments across five
pre-trained models and nine standard text and image retrieval datasets, NUDGE
runs in minutes and often improves NDCG@10 by more than 10% over existing
fine-tuning methods. On average, NUDGE provides 3.3x and 4.3x higher increase
in accuracy and runs 200x and 3x faster, respectively, over fine-tuning the
pre-trained model and training adaptors.
♻ ☆ LADDER: Language Driven Slice Discovery and Error Rectification
Error slice discovery associates structured patterns with model errors.
Existing methods discover error slices by clustering the error-prone samples
with similar patterns or assigning discrete attributes to each sample for
post-hoc analysis. While these methods aim for interpretability and easier
mitigation through reweighting or rebalancing, they may not capture the full
complexity of error patterns due to incomplete or missing attributes. Contrary
to the existing approach, this paper utilizes the reasoning capabilities of the
Large Language Model (LLM) to analyze complex error patterns and generate
testable hypotheses. This paper proposes LADDER: Language Driven slice
Discovery and Error Rectification. It first projects the model's representation
into a language-aligned feature space (eg CLIP) to preserve semantics in the
original model feature space. This ensures the accurate retrieval of sentences
that highlight the model's errors. Next, the LLM utilizes the sentences and
generates hypotheses to discover error slices. Finally, we mitigate the error
by fine-tuning the classification head by creating a group-balanced dataset
using the hypotheses. Our entire method does not require any attribute
annotation, either explicitly or through external tagging models. We validate
our method with \textbf{five} image classification datasets. The code is
available (https://github.com/batmanlab/Ladder).
♻ ☆ The Need for Guardrails with Large Language Models in Medical Safety-Critical Settings: An Artificial Intelligence Application in the Pharmacovigilance Ecosystem
Joe B Hakim, Jeffery L Painter, Darmendra Ramcharran, Vijay Kara, Greg Powell, Paulina Sobczak, Chiho Sato, Andrew Bate, Andrew Beam
Large language models (LLMs) are useful tools with the capacity for
performing specific types of knowledge work at an effective scale. However, LLM
deployments in high-risk and safety-critical domains pose unique challenges,
notably the issue of ``hallucination,'' where LLMs can generate fabricated
information. This is particularly concerning in settings such as drug safety,
where inaccuracies could lead to patient harm. To mitigate these risks, we have
developed and demonstrated a proof of concept suite of guardrails specifically
designed to mitigate certain types of hallucinations and errors for drug
safety, and potentially applicable to other medical safety-critical contexts.
These guardrails include mechanisms to detect anomalous documents to prevent
the ingestion of inappropriate data, identify incorrect drug names or adverse
event terms, and convey uncertainty in generated content. We integrated these
guardrails with an LLM fine-tuned for a text-to-text task, which involves
converting both structured and unstructured data within adverse event reports
into natural language. This method was applied to translate individual case
safety reports, demonstrating effective application in a pharmacovigilance
processing task. Our guardrail framework offers a set of tools with broad
applicability across various domains, ensuring LLMs can be safely used in
high-risk situations by eliminating the occurrence of key errors, including the
generation of incorrect pharmacovigilance-related terms, thus adhering to
stringent regulatory and quality standards in medical safety-critical
environments.
comment: 27 pages, 6 figures, 4 tables and supplementary material provided
♻ ☆ Simple and Scalable Strategies to Continually Pre-train Large Language Models
Adam Ibrahim, Benjamin Thérien, Kshitij Gupta, Mats L. Richter, Quentin Anthony, Timothée Lesort, Eugene Belilovsky, Irina Rish
Large language models (LLMs) are routinely pre-trained on billions of tokens,
only to start the process over again once new data becomes available. A much
more efficient solution is to continually pre-train these models, saving
significant compute compared to re-training. However, the distribution shift
induced by new data typically results in degraded performance on previous data
or poor adaptation to the new data. In this work, we show that a simple and
scalable combination of learning rate (LR) re-warming, LR re-decaying, and
replay of previous data is sufficient to match the performance of fully
re-training from scratch on all available data, as measured by the final loss
and the average score on several language model (LM) evaluation benchmarks.
Specifically, we show this for a weak but realistic distribution shift between
two commonly used LLM pre-training datasets (English$\rightarrow$English) and a
stronger distribution shift (English$\rightarrow$German) at the $405$M
parameter model scale with large dataset sizes (hundreds of billions of
tokens). Selecting the weak but realistic shift for larger-scale experiments,
we also find that our continual learning strategies match the re-training
baseline for a 10B parameter LLM. Our results demonstrate that LLMs can be
successfully updated via simple and scalable continual learning strategies,
matching the re-training baseline using only a fraction of the compute.
Finally, inspired by previous work, we propose alternatives to the cosine
learning rate schedule that help circumvent forgetting induced by LR re-warming
and that are not bound to a fixed token budget.
♻ ☆ LongRecipe: Recipe for Efficient Long Context Generalization in Large Language Models
Zhiyuan Hu, Yuliang Liu, Jinman Zhao, Suyuchen Wang, Yan Wang, Wei Shen, Qing Gu, Anh Tuan Luu, See-Kiong Ng, Zhiwei Jiang, Bryan Hooi
Large language models (LLMs) face significant challenges in handling
long-context tasks because of their limited effective context window size
during pretraining, which restricts their ability to generalize over extended
sequences. Meanwhile, extending the context window in LLMs through
post-pretraining is highly resource-intensive. To address this, we introduce
LongRecipe, an efficient training strategy for extending the context window of
LLMs, including impactful token analysis, position index transformation, and
training optimization strategies. It simulates long-sequence inputs while
maintaining training efficiency and significantly improves the model's
understanding of long-range dependencies. Experiments on three types of LLMs
show that LongRecipe can utilize long sequences while requiring only 30% of the
target context window size, and reduces computational training resource over
85% compared to full sequence training. Furthermore, LongRecipe also preserves
the original LLM's capabilities in general tasks. Ultimately, we can extend the
effective context window of open-source LLMs from 8k to 128k, achieving
performance close to GPT-4 with just one day of dedicated training using a
single GPU with 80G memory. Our code is released at
https://github.com/zhiyuanhubj/LongRecipe.
comment: Work in Progress
♻ ☆ Revisiting Character-level Adversarial Attacks for Language Models ICML 2024
Adversarial attacks in Natural Language Processing apply perturbations in the
character or token levels. Token-level attacks, gaining prominence for their
use of gradient-based methods, are susceptible to altering sentence semantics,
leading to invalid adversarial examples. While character-level attacks easily
maintain semantics, they have received less attention as they cannot easily
adopt popular gradient-based methods, and are thought to be easy to defend.
Challenging these beliefs, we introduce Charmer, an efficient query-based
adversarial attack capable of achieving high attack success rate (ASR) while
generating highly similar adversarial examples. Our method successfully targets
both small (BERT) and large (Llama 2) models. Specifically, on BERT with SST-2,
Charmer improves the ASR in 4.84% points and the USE similarity in 8% points
with respect to the previous art. Our implementation is available in
https://github.com/LIONS-EPFL/Charmer.
comment: Accepted in ICML 2024
♻ ☆ LogicGame: Benchmarking Rule-Based Reasoning Abilities of Large Language Models
Jiayi Gui, Yiming Liu, Jiale Cheng, Xiaotao Gu, Xiao Liu, Hongning Wang, Yuxiao Dong, Jie Tang, Minlie Huang
Large Language Models (LLMs) have demonstrated notable capabilities across
various tasks, showcasing complex problem-solving abilities. Understanding and
executing complex rules, along with multi-step planning, are fundamental to
logical reasoning and critical for practical LLM agents and decision-making
systems. However, evaluating LLMs as effective rule-based executors and
planners remains underexplored. In this paper, we introduce LogicGame, a novel
benchmark designed to evaluate the comprehensive rule understanding, execution,
and planning capabilities of LLMs. Unlike traditional benchmarks, LogicGame
provides diverse games that contain a series of rules with an initial state,
requiring models to comprehend and apply predefined regulations to solve
problems. We create simulated scenarios in which models execute or plan
operations to achieve specific outcomes. These game scenarios are specifically
designed to distinguish logical reasoning from mere knowledge by relying
exclusively on predefined rules. This separation allows for a pure assessment
of rule-based reasoning capabilities. The evaluation considers not only final
outcomes but also intermediate steps, providing a comprehensive assessment of
model performance. Moreover, these intermediate steps are deterministic and can
be automatically verified. LogicGame defines game scenarios with varying
difficulty levels, from simple rule applications to complex reasoning chains,
in order to offer a precise evaluation of model performance on rule
understanding and multi-step execution. Utilizing LogicGame, we test various
LLMs and identify notable shortcomings in their rule-based logical reasoning
abilities.
♻ ☆ AI-generated text boundary detection with RoFT
Laida Kushnareva, Tatiana Gaintseva, German Magai, Serguei Barannikov, Dmitry Abulkhanov, Kristian Kuznetsov, Eduard Tulchinskii, Irina Piontkovskaya, Sergey Nikolenko
Due to the rapid development of large language models, people increasingly
often encounter texts that may start as written by a human but continue as
machine-generated. Detecting the boundary between human-written and
machine-generated parts of such texts is a challenging problem that has not
received much attention in literature. We attempt to bridge this gap and
examine several ways to adapt state of the art artificial text detection
classifiers to the boundary detection setting. We push all detectors to their
limits, using the Real or Fake text benchmark that contains short texts on
several topics and includes generations of various language models. We use this
diversity to deeply examine the robustness of all detectors in cross-domain and
cross-model settings to provide baselines and insights for future research. In
particular, we find that perplexity-based approaches to boundary detection tend
to be more robust to peculiarities of domain-specific data than supervised
fine-tuning of the RoBERTa model; we also find which features of the text
confuse boundary detection algorithms and negatively influence their
performance in cross-domain settings.
comment: Our official repository:
https://github.com/SilverSolver/ai_boundary_detection
♻ ☆ Negation Blindness in Large Language Models: Unveiling the NO Syndrome in Image Generation
Foundational Large Language Models (LLMs) have changed the way we perceive
technology. They have been shown to excel in tasks ranging from poem writing
and coding to essay generation and puzzle solving. With the incorporation of
image generation capability, they have become more comprehensive and versatile
AI tools. At the same time, researchers are striving to identify the
limitations of these tools to improve them further. Currently identified flaws
include hallucination, biases, and bypassing restricted commands to generate
harmful content. In the present work, we have identified a fundamental
limitation related to the image generation ability of LLMs, and termed it The
NO Syndrome. This negation blindness refers to LLMs inability to correctly
comprehend NO related natural language prompts to generate the desired images.
Interestingly, all tested LLMs including GPT-4, Gemini, and Copilot were found
to be suffering from this syndrome. To demonstrate the generalization of this
limitation, we carried out simulation experiments and conducted entropy-based
and benchmark statistical analysis tests on various LLMs in multiple languages,
including English, Hindi, and French. We conclude that the NO syndrome is a
significant flaw in current LLMs that needs to be addressed. A related finding
of this study showed a consistent discrepancy between image and textual
responses as a result of this NO syndrome. We posit that the introduction of a
negation context-aware reinforcement learning based feedback loop between the
LLMs textual response and generated image could help ensure the generated text
is based on both the LLMs correct contextual understanding of the negation
query and the generated visual output.
comment: 15 pages, 7 figures
♻ ☆ Seeing Like an AI: How LLMs Apply (and Misapply) Wikipedia Neutrality Norms
Large language models (LLMs) are trained on broad corpora and then used in
communities with specialized norms. Is providing LLMs with community rules
enough for models to follow these norms? We evaluate LLMs' capacity to detect
(Task 1) and correct (Task 2) biased Wikipedia edits according to Wikipedia's
Neutral Point of View (NPOV) policy. LLMs struggled with bias detection,
achieving only 64% accuracy on a balanced dataset. Models exhibited contrasting
biases (some under- and others over-predicted bias), suggesting distinct priors
about neutrality. LLMs performed better at generation, removing 79% of words
removed by Wikipedia editors. However, LLMs made additional changes beyond
Wikipedia editors' simpler neutralizations, resulting in high-recall but
low-precision editing. Interestingly, crowdworkers rated AI rewrites as more
neutral (70%) and fluent (61%) than Wikipedia-editor rewrites. Qualitative
analysis found LLMs sometimes applied NPOV more comprehensively than Wikipedia
editors but often made extraneous non-NPOV-related changes (such as grammar).
LLMs may apply rules in ways that resonate with the public but diverge from
community experts. While potentially effective for generation, LLMs may reduce
editor agency and increase moderation workload (e.g., verifying additions).
Even when rules are easy to articulate, having LLMs apply them like community
members may still be difficult.
♻ ☆ A Causal Explainable Guardrails for Large Language Models
Large Language Models (LLMs) have shown impressive performance in natural
language tasks, but their outputs can exhibit undesirable attributes or biases.
Existing methods for steering LLMs toward desired attributes often assume
unbiased representations and rely solely on steering prompts. However, the
representations learned from pre-training can introduce semantic biases that
influence the steering process, leading to suboptimal results. We propose
LLMGuardrail, a novel framework that incorporates causal analysis and
adversarial learning to obtain unbiased steering representations in LLMs.
LLMGuardrail systematically identifies and blocks the confounding effects of
biases, enabling the extraction of unbiased steering representations.
Additionally, it includes an explainable component that provides insights into
the alignment between the generated output and the desired direction.
Experiments demonstrate LLMGuardrail's effectiveness in steering LLMs toward
desired attributes while mitigating biases. Our work contributes to the
development of safe and reliable LLMs that align with desired attributes.
comment: 16 pages
♻ ☆ Parallel Speculative Decoding with Adaptive Draft Length
Speculative decoding (SD), where an extra draft model is employed to provide
multiple \textit{draft} tokens first and then the original target model
verifies these tokens in parallel, has shown great power for LLM inference
acceleration. However, existing SD methods suffer from the mutual waiting
problem, i.e., the target model gets stuck when the draft model is
\textit{guessing} tokens, and vice versa. This problem is directly incurred by
the asynchronous execution of the draft model and the target model, and is
exacerbated due to the fixed draft length in speculative decoding. To address
these challenges, we propose a conceptually simple, flexible, and general
framework to boost speculative decoding, namely \textbf{P}arallel
sp\textbf{E}culative decoding with \textbf{A}daptive d\textbf{R}aft
\textbf{L}ength (PEARL). Specifically, PEARL proposes \textit{pre-verify} to
verify the first draft token in advance during the drafting phase, and
\textit{post-verify} to generate more draft tokens during the verification
phase. PEARL parallels the drafting phase and the verification phase via
applying the two strategies, and achieves adaptive draft length for different
scenarios, which effectively alleviates the mutual waiting problem. Moreover,
we theoretically demonstrate that the mean accepted tokens of PEARL is more
than existing \textit{draft-then-verify} works. Experiments on various text
generation benchmarks demonstrate the effectiveness of our \name, leading to a
superior speedup performance up to \textbf{3.79$\times$} and
\textbf{1.52$\times$}, compared to auto-regressive decoding and vanilla
speculative decoding, respectively.
♻ ☆ HIRO: Hierarchical Information Retrieval Optimization
Retrieval-Augmented Generation (RAG) has revolutionized natural language
processing by dynamically integrating external knowledge into Large Language
Models (LLMs), addressing their limitation of static training datasets. Recent
implementations of RAG leverage hierarchical data structures, which organize
documents at various levels of summarization and information density. This
complexity, however, can cause LLMs to "choke" on information overload,
necessitating more sophisticated querying mechanisms. In this context, we
introduce Hierarchical Information Retrieval Optimization (HIRO), a novel
querying approach that employs a Depth-First Search (DFS)-based recursive
similarity score calculation and branch pruning. This method uniquely minimizes
the context delivered to the LLM without informational loss, effectively
managing the challenge of excessive data. HIRO's refined approach is validated
by a 10.85% improvement in performance on the NarrativeQA dataset.
♻ ☆ What Formal Languages Can Transformers Express? A Survey
As transformers have gained prominence in natural language processing, some
researchers have investigated theoretically what problems they can and cannot
solve, by treating problems as formal languages. Exploring such questions can
help clarify the power of transformers relative to other models of computation,
their fundamental capabilities and limits, and the impact of architectural
choices. Work in this subarea has made considerable progress in recent years.
Here, we undertake a comprehensive survey of this work, documenting the diverse
assumptions that underlie different results and providing a unified framework
for harmonizing seemingly contradictory findings.
comment: One minor correction in {\S}5.1
♻ ☆ Large Language Models for Information Retrieval: A Survey
Yutao Zhu, Huaying Yuan, Shuting Wang, Jiongnan Liu, Wenhan Liu, Chenlong Deng, Haonan Chen, Zheng Liu, Zhicheng Dou, Ji-Rong Wen
As a primary means of information acquisition, information retrieval (IR)
systems, such as search engines, have integrated themselves into our daily
lives. These systems also serve as components of dialogue, question-answering,
and recommender systems. The trajectory of IR has evolved dynamically from its
origins in term-based methods to its integration with advanced neural models.
While the neural models excel at capturing complex contextual signals and
semantic nuances, thereby reshaping the IR landscape, they still face
challenges such as data scarcity, interpretability, and the generation of
contextually plausible yet potentially inaccurate responses. This evolution
requires a combination of both traditional methods (such as term-based sparse
retrieval methods with rapid response) and modern neural architectures (such as
language models with powerful language understanding capacity). Meanwhile, the
emergence of large language models (LLMs), typified by ChatGPT and GPT-4, has
revolutionized natural language processing due to their remarkable language
understanding, generation, generalization, and reasoning abilities.
Consequently, recent research has sought to leverage LLMs to improve IR
systems. Given the rapid evolution of this research trajectory, it is necessary
to consolidate existing methodologies and provide nuanced insights through a
comprehensive overview. In this survey, we delve into the confluence of LLMs
and IR systems, including crucial aspects such as query rewriters, retrievers,
rerankers, and readers. Additionally, we explore promising directions, such as
search agents, within this expanding field.
comment: updated to version 3
♻ ☆ Towards a Universal Method for Meaningful Signal Detection
It is known that human speech and certain animal vocalizations can convey
meaningful content because we can decipher the content that a given utterance
does convey. This paper explores an alternative approach to determining whether
a signal is meaningful, one that analyzes only the signal itself and is
independent of what the conveyed meaning might be. We devise a method that
takes a waveform as input and outputs a score indicating its degree of
`meaningfulness`. We cluster contiguous portions of the input to minimize the
total description length, and then take the length of the code of the assigned
cluster labels as meaningfulness score. We evaluate our method empirically,
against several baselines, and show that it is the only one to give a high
score to human speech in various languages and with various speakers, a
moderate score to animal vocalizations from birds and orcas, and a low score to
ambient noise from various sources.
♻ ☆ Open Implementation and Study of BEST-RQ for Speech Processing ICASSP 2024
Self-Supervised Learning (SSL) has proven to be useful in various speech
tasks. However, these methods are generally very demanding in terms of data,
memory, and computational resources. BERT-based Speech pre-Training with
Random-projection Quantizer (BEST-RQ), is an SSL method that has shown great
performance on Automatic Speech Recognition (ASR) while being simpler than
other SSL methods, such as wav2vec 2.0. Despite BEST-RQ's great performance,
details are lacking in the original paper, such as the amount of GPU/TPU hours
used in pre-training, and there is no official easy-to-use open-source
implementation. Furthermore, BEST-RQ has not been evaluated on other downstream
tasks aside from ASR and speech translation. In this work, we describe a
re-implementation of a Random-projection quantizer and perform a preliminary
study with a comparison to wav2vec 2.0 on four downstream tasks. We discuss the
details and differences of our implementation. We show that a random projection
quantizer can achieve similar downstream performance as wav2vec 2.0 while
decreasing training time by over a factor of two.
comment: Accepted in IEEE ICASSP 2024 workshop on Self-supervision in Audio,
Speech and Beyond (SASB 2024)
♻ ☆ Prompt Compression with Context-Aware Sentence Encoding for Fast and Improved LLM Inference
Large language models (LLMs) have triggered a new stream of research focusing
on compressing the context length to reduce the computational cost while
ensuring the retention of helpful information for LLMs to answer the given
question. Token-based removal methods are one of the most prominent approaches
in this direction, but risk losing the semantics of the context caused by
intermediate token removal, especially under high compression ratios, while
also facing challenges in computational efficiency. In this work, we propose
context-aware prompt compression (CPC), a sentence-level prompt compression
technique where its key innovation is a novel context-aware sentence encoder
that provides a relevance score for each sentence for a given question. To
train this encoder, we generate a new dataset consisting of questions,
positives, and negative pairs where positives are sentences relevant to the
question, while negatives are irrelevant context sentences. We train the
encoder in a contrastive setup to learn context-aware sentence representations.
Our method considerably outperforms prior works on prompt compression on
benchmark datasets and is up to 10.93x faster at inference compared to the best
token-level compression method. We also find better improvement for shorter
length constraints in most benchmarks, showing the effectiveness of our
proposed solution in the compression of relevant information in a shorter
context. Finally, we release the code and the dataset for quick reproducibility
and further development: https://github.com/Workday/cpc.
♻ ☆ CADGE: Context-Aware Dialogue Generation Enhanced with Graph-Structured Knowledge Aggregation
Commonsense knowledge is crucial to many natural language processing tasks.
Existing works usually incorporate graph knowledge with conventional graph
neural networks (GNNs), leading to the text and graph knowledge encoding
processes being separated in a serial pipeline. We argue that these separate
representation learning stages may be suboptimal for neural networks to learn
the overall context contained in both types of input knowledge. In this paper,
we propose a novel context-aware graph-attention model (Context-aware GAT),
which can effectively incorporate global features of relevant knowledge graphs
based on a context-enhanced knowledge aggregation process. Specifically, our
framework leverages a novel representation learning approach to process
heterogeneous features - combining flattened graph knowledge with text. To the
best of our knowledge, this is the first attempt at hierarchically applying
graph knowledge aggregation on a connected subgraph in addition to contextual
information to support commonsense dialogue generation. This framework shows
superior performance compared to conventional GNN-based language frameworks.
Both automatic and human evaluation demonstrates that our proposed model has
significant performance uplifts over state-of-the-art baselines.
comment: Accepted by INLG 2024
♻ ☆ Enhancing Sindhi Word Segmentation using Subword Representation Learning and Position-aware Self-attention
Sindhi word segmentation is a challenging task due to space omission and
insertion issues. The Sindhi language itself adds to this complexity. It's
cursive and consists of characters with inherent joining and non-joining
properties, independent of word boundaries. Existing Sindhi word segmentation
methods rely on designing and combining hand-crafted features. However, these
methods have limitations, such as difficulty handling out-of-vocabulary words,
limited robustness for other languages, and inefficiency with large amounts of
noisy or raw text. Neural network-based models, in contrast, can automatically
capture word boundary information without requiring prior knowledge. In this
paper, we propose a Subword-Guided Neural Word Segmenter (SGNWS) that addresses
word segmentation as a sequence labeling task. The SGNWS model incorporates
subword representation learning through a bidirectional long short-term memory
encoder, position-aware self-attention, and a conditional random field. Our
empirical results demonstrate that the SGNWS model achieves state-of-the-art
performance in Sindhi word segmentation on six datasets.
comment: Journal Paper, 14 pages
♻ ☆ A Sentence is Worth a Thousand Pictures: Can Large Language Models Understand Hum4n L4ngu4ge and the W0rld behind W0rds?
Modern Artificial Intelligence applications show great potential for
language-related tasks that rely on next-word prediction. The current
generation of Large Language Models (LLMs) have been linked to claims about
human-like linguistic performance and their applications are hailed both as a
step towards artificial general intelligence and as a major advance in
understanding the cognitive, and even neural basis of human language. To assess
these claims, first we analyze the contribution of LLMs as theoretically
informative representations of a target cognitive system vs. atheoretical
mechanistic tools. Second, we evaluate the models' ability to see the bigger
picture, through top-down feedback from higher levels of processing, which
requires grounding in previous expectations and past world experience. We
hypothesize that since models lack grounded cognition, they cannot take
advantage of these features and instead solely rely on fixed associations
between represented words and word vectors. To assess this, we designed and ran
a novel 'leet task' (l33t t4sk), which requires decoding sentences in which
letters are systematically replaced by numbers. The results suggest that humans
excel in this task whereas models struggle, confirming our hypothesis. We
interpret the results by identifying the key abilities that are still missing
from the current state of development of these models, which require solutions
that go beyond increased system scaling.
♻ ☆ Exploring Interpretability of Independent Components of Word Embeddings with Automated Word Intruder Test LREC
Independent Component Analysis (ICA) is an algorithm originally developed for
finding separate sources in a mixed signal, such as a recording of multiple
people in the same room speaking at the same time. Unlike Principal Component
Analysis (PCA), ICA permits the representation of a word as an unstructured set
of features, without any particular feature being deemed more significant than
the others. In this paper, we used ICA to analyze word embeddings. We have
found that ICA can be used to find semantic features of the words, and these
features can easily be combined to search for words that satisfy the
combination. We show that most of the independent components represent such
features. To quantify the interpretability of the components, we use the word
intruder test, performed both by humans and by large language models. We
propose to use the automated version of the word intruder test as a fast and
inexpensive way of quantifying vector interpretability without the need for
human effort.
comment: Presented at LREC-COLING 2024, cite this version please:
https://aclanthology.org/2024.lrec-main.605/
♻ ☆ Vision-Language and Large Language Model Performance in Gastroenterology: GPT, Claude, Llama, Phi, Mistral, Gemma, and Quantized Models
Seyed Amir Ahmad Safavi-Naini, Shuhaib Ali, Omer Shahab, Zahra Shahhoseini, Thomas Savage, Sara Rafiee, Jamil S Samaan, Reem Al Shabeeb, Farah Ladak, Jamie O Yang, Juan Echavarria, Sumbal Babar, Aasma Shaukat, Samuel Margolis, Nicholas P Tatonetti, Girish Nadkarni, Bara El Kurdi, Ali Soroush
Background and Aims: This study evaluates the medical reasoning performance
of large language models (LLMs) and vision language models (VLMs) in
gastroenterology.
Methods: We used 300 gastroenterology board exam-style multiple-choice
questions, 138 of which contain images to systematically assess the impact of
model configurations and parameters and prompt engineering strategies utilizing
GPT-3.5. Next, we assessed the performance of proprietary and open-source LLMs
(versions), including GPT (3.5, 4, 4o, 4omini), Claude (3, 3.5), Gemini (1.0),
Mistral, Llama (2, 3, 3.1), Mixtral, and Phi (3), across different interfaces
(web and API), computing environments (cloud and local), and model precisions
(with and without quantization). Finally, we assessed accuracy using a
semiautomated pipeline.
Results: Among the proprietary models, GPT-4o (73.7%) and Claude3.5-Sonnet
(74.0%) achieved the highest accuracy, outperforming the top open-source
models: Llama3.1-405b (64%), Llama3.1-70b (58.3%), and Mixtral-8x7b (54.3%).
Among the quantized open-source models, the 6-bit quantized Phi3-14b (48.7%)
performed best. The scores of the quantized models were comparable to those of
the full-precision models Llama2-7b, Llama2--13b, and Gemma2-9b. Notably, VLM
performance on image-containing questions did not improve when the images were
provided and worsened when LLM-generated captions were provided. In contrast, a
10% increase in accuracy was observed when images were accompanied by
human-crafted image descriptions.
Conclusion: In conclusion, while LLMs exhibit robust zero-shot performance in
medical reasoning, the integration of visual data remains a challenge for VLMs.
Effective deployment involves carefully determining optimal model
configurations, encouraging users to consider either the high performance of
proprietary models or the flexible adaptability of open-source models.
comment: Manuscript Pages: 34, Figures: 7, Tables: 2, Supplementary File
Pages: 35, Data Transparency Statement: Code is available at:
https://github.com/Sdamirsa/LLM-VLM-in-Gastroenterology . Study data from
American College of Gastroenterology (ACG) are restricted and available upon
request with ACG permission. Correction: updated abstract considering
Llama3.1 results
♻ ☆ Towards Measuring and Modeling "Culture" in LLMs: A Survey
Muhammad Farid Adilazuarda, Sagnik Mukherjee, Pradhyumna Lavania, Siddhant Singh, Alham Fikri Aji, Jacki O'Neill, Ashutosh Modi, Monojit Choudhury
We present a survey of more than 90 recent papers that aim to study cultural
representation and inclusion in large language models (LLMs). We observe that
none of the studies explicitly define "culture, which is a complex,
multifaceted concept; instead, they probe the models on some specially designed
datasets which represent certain aspects of "culture". We call these aspects
the proxies of culture, and organize them across two dimensions of demographic
and semantic proxies. We also categorize the probing methods employed. Our
analysis indicates that only certain aspects of ``culture,'' such as values and
objectives, have been studied, leaving several other interesting and important
facets, especially the multitude of semantic domains (Thompson et al., 2020)
and aboutness (Hershcovich et al., 2022), unexplored. Two other crucial gaps
are the lack of robustness of probing techniques and situated studies on the
impact of cultural mis- and under-representation in LLM-based applications.
♻ ☆ Jina-ColBERT-v2: A General-Purpose Multilingual Late Interaction Retriever EMNLP
Rohan Jha, Bo Wang, Michael Günther, Georgios Mastrapas, Saba Sturua, Isabelle Mohr, Andreas Koukounas, Mohammad Kalim Akram, Nan Wang, Han Xiao
Multi-vector dense models, such as ColBERT, have proven highly effective in
information retrieval. ColBERT's late interaction scoring approximates the
joint query-document attention seen in cross-encoders while maintaining
inference efficiency closer to traditional dense retrieval models, thanks to
its bi-encoder architecture and recent optimizations in indexing and search. In
this paper, we introduce a novel architecture and a training framework to
support long context window and multilingual retrieval. Our new model,
Jina-ColBERT-v2, demonstrates strong performance across a range of English and
multilingual retrieval tasks,
comment: 8 pages, references at pp7,8; EMNLP workshop submission
♻ ☆ An Empirical Study on Information Extraction using Large Language Models
Human-like large language models (LLMs), especially the most powerful and
popular ones in OpenAI's GPT family, have proven to be very helpful for many
natural language processing (NLP) related tasks. Therefore, various attempts
have been made to apply LLMs to information extraction (IE), which is a
fundamental NLP task that involves extracting information from unstructured
plain text. To demonstrate the latest representative progress in LLMs'
information extraction ability, we assess the information extraction ability of
GPT-4 (the latest version of GPT at the time of writing this paper) from four
perspectives: Performance, Evaluation Criteria, Robustness, and Error Types.
Our results suggest a visible performance gap between GPT-4 and
state-of-the-art (SOTA) IE methods. To alleviate this problem, considering the
LLMs' human-like characteristics, we propose and analyze the effects of a
series of simple prompt-based methods, which can be generalized to other LLMs
and NLP tasks. Rich experiments show our methods' effectiveness and some of
their remaining issues in improving GPT-4's information extraction ability.
comment: This article has an original arxiv version entitled "Is Information
Extraction Solved by ChatGPT? An Analysis of Performance, Evaluation
Criteria, Robustness and Errors", whose url link is arXiv/2305.14450
♻ ☆ Can AI Replace Human Subjects? A Large-Scale Replication of Psychological Experiments with LLMs
Artificial Intelligence (AI) is increasingly being integrated into scientific
research, particularly in the social sciences, where understanding human
behavior is critical. Large Language Models (LLMs) like GPT-4 have shown
promise in replicating human-like responses in various psychological
experiments. However, the extent to which LLMs can effectively replace human
subjects across diverse experimental contexts remains unclear. Here, we conduct
a large-scale study replicating 154 psychological experiments from top social
science journals with 618 main effects and 138 interaction effects using GPT-4
as a simulated participant. We find that GPT-4 successfully replicates 76.0
percent of main effects and 47.0 percent of interaction effects observed in the
original studies, closely mirroring human responses in both direction and
significance. However, only 19.44 percent of GPT-4's replicated confidence
intervals contain the original effect sizes, with the majority of replicated
effect sizes exceeding the 95 percent confidence interval of the original
studies. Additionally, there is a 71.6 percent rate of unexpected significant
results where the original studies reported null findings, suggesting potential
overestimation or false positives. Our results demonstrate the potential of
LLMs as powerful tools in psychological research but also emphasize the need
for caution in interpreting AI-driven findings. While LLMs can complement human
studies, they cannot yet fully replace the nuanced insights provided by human
subjects.
comment: 5 figures, 2 tables
♻ ☆ Enhancing Dialogue Generation in Werewolf Game Through Situation Analysis and Persuasion Strategies
Recent advancements in natural language processing, particularly with large
language models (LLMs) like GPT-4, have significantly enhanced dialogue
systems, enabling them to generate more natural and fluent conversations.
Despite these improvements, challenges persist, such as managing continuous
dialogues, memory retention, and minimizing hallucinations. The AIWolfDial2024
addresses these challenges by employing the Werewolf Game, an incomplete
information game, to test the capabilities of LLMs in complex interactive
environments. This paper introduces a LLM-based Werewolf Game AI, where each
role is supported by situation analysis to aid response generation.
Additionally, for the werewolf role, various persuasion strategies, including
logical appeal, credibility appeal, and emotional appeal, are employed to
effectively persuade other players to align with its actions.
comment: Accepted to the AIWolfDial2024 workshop at INLG 2024
♻ ☆ SELF-[IN]CORRECT: LLMs Struggle with Discriminating Self-Generated Responses
Can LLMs consistently improve their previous outputs for better results? For
this to be true, LLMs would need to be better at discriminating among
previously-generated alternatives, than generating initial responses. We
explore the validity of this hypothesis in practice. We first formulate a
unified framework that allows us to compare the generative and discriminative
capability of any model on any task. In our resulting experimental analysis of
several open-source and industrial LLMs, we observe that models are not
reliably better at discriminating among previously-generated alternatives than
generating initial responses. This finding challenges the notion that LLMs may
be able to enhance their performance only through their own judgment.
♻ ☆ LLM Defenses Are Not Robust to Multi-Turn Human Jailbreaks Yet
Nathaniel Li, Ziwen Han, Ian Steneker, Willow Primack, Riley Goodside, Hugh Zhang, Zifan Wang, Cristina Menghini, Summer Yue
Recent large language model (LLM) defenses have greatly improved models'
ability to refuse harmful queries, even when adversarially attacked. However,
LLM defenses are primarily evaluated against automated adversarial attacks in a
single turn of conversation, an insufficient threat model for real-world
malicious use. We demonstrate that multi-turn human jailbreaks uncover
significant vulnerabilities, exceeding 70% attack success rate (ASR) on
HarmBench against defenses that report single-digit ASRs with automated
single-turn attacks. Human jailbreaks also reveal vulnerabilities in machine
unlearning defenses, successfully recovering dual-use biosecurity knowledge
from unlearned models. We compile these results into Multi-Turn Human
Jailbreaks (MHJ), a dataset of 2,912 prompts across 537 multi-turn jailbreaks.
We publicly release MHJ alongside a compendium of jailbreak tactics developed
across dozens of commercial red teaming engagements, supporting research
towards stronger LLM defenses.
♻ ☆ Anchored Preference Optimization and Contrastive Revisions: Addressing Underspecification in Alignment
Karel D'Oosterlinck, Winnie Xu, Chris Develder, Thomas Demeester, Amanpreet Singh, Christopher Potts, Douwe Kiela, Shikib Mehri
Large Language Models (LLMs) are often aligned using contrastive alignment
objectives and preference pair datasets. The interaction between model, paired
data, and objective makes alignment a complicated procedure, sometimes
producing subpar results. We study this and find that (i) preference data gives
a better learning signal when the underlying responses are contrastive, and
(ii) alignment objectives lead to better performance when they specify more
control over the model during training. Based on these insights, we introduce
Contrastive Learning from AI Revisions (CLAIR), a data-creation method which
leads to more contrastive preference pairs, and Anchored Preference
Optimization (APO), a controllable and more stable alignment objective. We
align Llama-3-8B-Instruct using various comparable datasets and alignment
objectives and measure MixEval-Hard scores, which correlate highly with human
judgments. The CLAIR preferences lead to the strongest performance out of all
datasets, and APO consistently outperforms less controllable objectives. Our
best model, trained on 32K CLAIR preferences with APO, improves
Llama-3-8B-Instruct by 7.65%, closing the gap with GPT4-turbo by 45%. Our code
is available at https://github.com/ContextualAI/CLAIR_and_APO.